TITLE by Aesha Alshammri

Introduction: explor red wine quality tidy data set that
contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 3 (very bad) and 8 (very excellent).

Univariate Plots Section

****: In this section, will performing some preliminary exploration of the dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in the dataset.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

1- Quality plot

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

The quality variabel is categorical data, the minimum value is 3(very bad quality) and the maximum value 8(very excellent quality). The most quality of wines is between rate 5 and 6. which is not very bad and not very excellent.

2- plot all 11 11 variables on the chemical properties

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The maximum fixed acidity is 15.90 with low wines and the most wines is with 7 fixed acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The maximum volatile acidity is 1.5800 with low wines and the most wines is with about 0.7 volatile acidity.

The citric acid is Right skewed which is the most wines is with low citric acid.

summary(pf$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

There is long tail data we will transforming data for shorting the tail

Here is after transforming data by using log 10, it is much better to understand the data, the most wines is with 2.5 residual sugar

Also there is long tail data as above we will transforming data using log 10

Now it is normal distribution, and the most wines is with about 0.08 chlorides

The free sulfur dioxide is Right skewed. which is the most wines is below 20 free sulfur dioxide.

It is right skewed with long tail data let’s transforming data with log 10

Now it is normal distribution look much better and understandable which is the most wines is with 50 total sulfur dioxide.

The distribution of density is normal and most wines come with about 0.9997 density

The distribution of pH is normal and most wines with about 3.4 pH

There is long tail data so we will transforming data with log 10 as bellow

Now it is close to normal distribution most wines have about 0.7 sulphates

the most wines with 9.5 alcohol

Univariate Analysis

What is the structure of your dataset?

The dataset have 11 vaiabels is about the chemical properties of the wine and quality variable “quality of wine” is output variable which contain scores 3-8 (3 “very bad”, 4 “bad” , 5 “good”, 6 “very good”, 7 “exellent”, 8 “very excellent”). The data types in the dataset are numeric and integer. The type of quality variable consider as categorical type.

What is/are the main feature(s) of interest in your dataset?

The main feature intersting is quality with 11 vaiabels of chemical properties. which of 11 vaiabels is infleuence the quality to high. ### What other features in the dataset do you think will help support your
The 11 vaiabels of chemical properties.

Did you create any new variables from existing variables in the dataset?

No, not need

Of the features you investigated, were there any unusual distributions?
The residual.sugar chlorides total.sulfur.dioxide sulphates histograms were

with too long-tailed so appling a log 10 transform to the x-axis then the distribution became more normal more understandable.

Bivariate Plots Section

First I need to look the correlation coefficients between all variabels using correlation matrix.

Note that if the value of correlation coefficient: - between 0 and 0.3 (0 and -0.3) indicate a weak positive (negative) linear relationship - between 0.3 and 0.7 (-0.3 and -0.7) indicate a moderate positive (negative) linear relationship. - between 0.7 and 1.0 (-0.7 and -1.0) indicate a strong positive (negative) linear relationship. as shown in matrix correlation matrixe the most relationships are moderate There are 7 moderate and positive relationships.

Now need to see the scatter plot of the variabels that have high degree of association or correlation.

fixed acidity + citric acid plot

The relation between fixed acidity((tartaric acid) and citric acid plot is positive. The high clusters are between 7 - 8 of fixed acidity and 0.00 - 0.10 citric acid

The relation between fixed acidity((tartaric acid) and density plot is positive. The high clusters are between 7 - 8 of fixed acidity and 0.995 - 0.998 density

The relation between fixed acidity((tartaric acid) and pH plot is negative. The high clusters are between 6 - 8.5 of fixed acidity and 3.2 - 3.4 pH

The relation between fixed acidity((tartaric acid) and citric acid plot is negative.

The relation between density and alcohol plot is negative. The high clusters are between 0.996 - 0.998 of density and 9.5 - 10 alcohol

plot two main variable have perfect degree of association with quality 1- quality + alcohol

the alcohol variable is positive associated with quality if alcohol is high in wine the quality increase and if we see the quality 8 and 7

2-quality + volatile acidity

The volatile acidity variable negative associated with quality in which it affect on quality to decrease if there is high rate of volatile acidity.

pH variable positive associated with alcohol “small correlation”

pH variable positive associated with volatile acidity “small correlation”

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

all variables have moderate to weak relationship there are two moderate relationship with interest feature(qualit) 1- positive relationship with alcohol (0.48) 2- negative relationship with volatile acidity (-0.39)

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

since the alcohol is positive associated with quality, and volatile acidity variable negative associated with quality I focuse other variables have associated with alcohol and volatile acidity with high correlation value and that variables with other variables: First, alcohol: The high correlation value with alcohol was with density (negative) and the high correlation with density is fixed acidity (positive) and the high correlation with fixed acidity is pH (negative) Second, volatile acidity: The high correlation value with volatile acidity was with citric acid (negative) and the high correlation value with citric acid was with fixed acidity (positive) and the the high correlation with fixed acidity is pH (negative) what i need is variable or more that have positive relation with alcohol and negative relation with volatile acidity to affect in increase the quality. so back to correlation matrix and I found citric acid achive what i need. in which increase the rate of citric acid cause small affect increase in alcohol and decrease volatile acidity.

The other variable I found to increase quality is sulphates if we increase the rate of citric acid the sulphates will increase sulphates with quality is positive relation but (weak) It can cause to increase the quality.

What was the strongest relationship you found?

The strongest relationship is pH with fixed acidity (0.68).

Multivariate Plots Section

If the color become dark blue with filled shape(5,4,3) the quality is become low or bad quality and vice versa. as we can see the scatterplot above the most color becom dark with filled shape when the alcohol decrease and volatil acidity increase.

ggplot(pf, aes(x = volatile.acidity, y = citric.acid))+
  geom_point(aes(color = factor(quality), shape=factor(quality)), size = 2.5)

The not filled shape(6,7,8) mostly shape of quality 7 become if increase both citric.acid and decrease volatile acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

-The alcohol and volatile acidity property are related with quality in which most quality below 5 have 9-10 alcohol and 0.4-0.8 volatile acidity and if we increase alcohol rate and decrease volatile acidity the quality increase. mostly the quality 7

-The citric acid and volatile acidity related with quality as if increase citric acid and decrease volatile acidity the quality increase and decrease volatile acidity related with alcohol increase and then increase quality

Were there any interesting or surprising interactions between features?

yse, most quality are clearly increase and decrease (affect with properties) are quality 7 and 5. and quality 6 in alcohol and volatile acidity property relation is disarray. ### OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model. No, the limatation is the size sample of each quality.


Final Plots and Summary

Plot One

Description One

Using histogram to see the distibution of quality. The histogram show that the most wine in quality 5 “good” and few of wines are with quality 3“very bad” and 8“very excellent”.

Plot Two

Description Two

Using scatterplot to see the relationship strength and direction between quality and Alcohol property. The relation between quality and alcohol property is clearly strong and positive

Plot Three

Description Three

Using scatterplot with color and shape each quality scores to see each quality and how present within the two property of alcohol and volatile acidity. as showen above the quality 6 is disarray and the quality 5 have property that alcohol in 9-10 and volatile acidity 0.4-0.7.


Reflection

first: see which quality is high and we found the quality 5 second: see why is high which property mostly have relation with quality and found two property alcohol(positively relation) and volatile acidity(negative relation), and found quality 5 mostly have alcohol with 9-10 units, and volatile acidity 0.4-0.7, and as last plot see quality 7 and 8 be appear mostly when increase alcohol units, not affected very much by volatile acidity they go up and down. The struggles and surprise was with quality 6 it was disarray and it was the second quality after 5 with high wines.